

## Quantum Integer Programming

47-779

**Novel Ising Solvers** 

Carnegie Mellon University Tepper School of Business William Larimer Mellon, Founder





- Conventional vs. Natural Computing
- Solving the 2D regular Ising Problem
  - Graphic Processing Units
  - Tensor Processing Units
  - Field-programmable gate arrays
- Solving general Ising models
  - Graphic Processing Units
  - Simulated Bifurcation Machine
  - CMOS
  - Digital Annealers

#### Carnegie Mellon University

Tepper School of Business



## **Conventional (Von Neumann) vs. Natural** Computing



2D Ising model - Simple yet interesting

Main concern: How to parallelize



Arbitrary Ising - Applicable but hard!



Main concern: How to actually solve

Monte Carlo Simulations [1] A 20k-Spin Ising Chip to Solve Combinatorial Optimization Problems With CMOS Annealing. Yamaoka, Yoshimura, Hayashi, Okuyama, Aoki, and Carnegie Mellon University

Tepper School of Business

[2] https://arxiv.org/pdf/1807.10750.pdf



## **Specialized Hardware for Ising/QUBO**

#### **GPUs and TPUs**





## Complementary metal-oxide semiconductors (CMOS)





#### **Digital annealers**





#### Carnegie Mellon University Tepper School of Business

[1]https://arxiv.org/pdf/1807.10750.pc [2]https://arxiv.org/pdf/1903.11714.pc [3]https://arxiv.org/pdf/1806.08815.pc [4]https://spectrum.ieee.org/tech-talk

mos-digital-annealer-produces-quantum-computer-speeds

[5]https://science.sciencemag.org/content/sci/354/6312/614.full.pdf

William Larimer Mellon, Founder

#### **Oscillator Based Computing**



Fiber beamsplitter



## **Graphical Processing Units (GPU)**

#### CPU vs GPU

| CPU                                    | GPU                                    |
|----------------------------------------|----------------------------------------|
| Central Processing Unit                | Graphics Processing Unit               |
| Several cores                          | Many cores                             |
| Low latency                            | High throughput                        |
| Good for serial processing             | Good for parallel processing           |
| Can do a handful of operations at once | Can do thousands of operations at once |



3.



The Difference between a CPU and GPU

| CPU | GPU |
|-----|-----|

Specialized, electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images.... Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel.

#### **Carnegie Mellon University** Tepper School of Business

1. https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/ 2.

https://www.guora.com/Would-you-rather-fight-100-duck-sized-horses-or-one-horse-sized-duck

https://en.wikipedia.org/wiki/Graphics processing unit



## **Graphical Processing Units**

Algorithm 1 Modified Ising annealing algorithm

- 1: input:  $(M, N, \mathbf{S})$
- 2: initialize all  $\sigma_i$  in S
- 3: for sweep-id in  $\{1, 2, ..., M\}$  do
- 4: for  $\sigma_i$  in S do

5: 
$$\sigma_i \leftarrow \operatorname{argmin}(H(\sigma_i))$$
 based on  $\mathcal{H}_i(\sigma_i) = \left(-\sum J_{i,j}\sigma_j - h_i\right)\sigma_i$ 

- 6: end for
- 7: randomly choose and flip N spin glasses in **S**
- 8: decrease N
- 9: end for

Algorithm 2 GPU Simulated Annealing method for Ising model

input:  $(F_p, \mathbf{S})$ initialize ALL  $\sigma_i$  in  $\mathbf{S}$ while  $F_p > 0$  do for all  $\sigma_i \in \mathbf{S}$  in parallel do  $\sigma_i \leftarrow \operatorname{argmin}(H(\sigma_i))$ flip  $\sigma_i$  with probability  $F_p$ end for reduce  $F_p$ 



1. https://arxiv.org/pdf/1807.10750.pdf William Larimer Mellon, Founder

Device Grid 1 Block Block Block (0,0)(2,0)(1,0)Kernel 1 Block Block. Block (0,1)(1,1)(2,1)Grid 2 Kernel 2 Block (1,1) Thread Thread Thread Thread Thread (1,0)(2,0)(3,0)(4,0)(0.0)Thread Thread Thread Thread Thread (0,1)(1,1)(2,1)(3,1)(4,1)Thread Thread Thread Thread Thread (0,2)(1,2)(2,2)(3,2)(4,2)

6

Tepper School of Business



### **GPU for 2D Ising Models**



Encode spins into threads and classify +1 and -1 in separate buckets

Figure 1: (Color online) The spin lattice is processed by a variable number of blocks (a), where each block runs a variable number of threads (b). The threads update the spin lattice in two steps, *A* and *B*, using two kernel invocations (c).

## Perform the Ising Update via easily parallelizable operations.

|                              | Spinflips per $\mu$ s | Relative speed |  |
|------------------------------|-----------------------|----------------|--|
| CPU simple                   | 26.6                  | 0.11           |  |
| CPU multi-spin coding        | 226.7                 | 1.00           |  |
| shared memory                | 4415.8                | 19.50          |  |
| shared memory (Fermi)        | 8038.2                | 35.46          |  |
| multi-spin unmodified        | 3307.2                | 14.60          |  |
| multi-spin coding on the fly | 5175.8                | 22.80          |  |
| multi-spin coding linear     | 7977.4                | 35.20          |  |



Figure 2: (Color online) (a) The way a kernel processes a  $4 \times 4$  meta-spin. (b) Spins are extracted into shared memory and an update pattern is created (c). (d) Afterwards, the new spins are obtained using the update pattern (Spins on blue sites will be flipped), and written back to global memory.

1. https://arxiv.org/pdf/1007.3726.pdf

William Larimer Mellon, Founder



**Carnegie Mellon University** Tepper School of Business



#### **Correctness of GPU's results**



**Fig. 5.** Binder cumulant  $U_4$  in dependence of  $k_BT$  for various numbers *n* of spins per row and column of the two dimensional square lattice Ising model. n/2 corresponds to the involved number of threads per block on the GPU implementation. The curves of the Binder cumulants for various system sizes  $N = n^2$  cross almost perfectly at the critical temperature derived by Onsager [3], which is shown additionally as a dashed line. In each temperature step, the average was taken over  $10^7$  measurements.

It (sort of) matches Onsager analytical prediction!

**Carnegie Mellon University** Tepper School of Business

1. https://arxiv.org/pdf/1007.3726.pdf William Larimer Mellon, Founder





#### **Parallelizing GPUs**







Figure 4: (Color online) (a) Each GPU processes a "meta-spin" lattice of size  $N = n^2$ . The lattices are aligned on a super-lattice, and the outer borders are connected via periodic boundary conditions. In this example, 4 GPUs process a system of  $2^2 \cdot N$  spins. (b) A meta-spin update needs the 4 nearest neighbor meta-spins. On the borders of a lattice, each GPU needs the spin information of the neighboring lattices. The border information has to be passed between the GPUs. In our implementation this is done by using 8 neighbor arrays.

Figure 5: (Color online) Cluster performance for various system sizes (per GPU). For more than one GPU, spin flip performance scall linearly with the amount of GPUs. Again, optimal performance is reached at a lattice size of about 4096 × 4096 per GPU. Using 64 performance of 206 spinflips per nanoscenod can be achieved on a 800.000 × 800.000 lattice.

## Using 64 GPUs performance of 206 spinflips per **nanosecond** can be achieved on a 800,000x800,000 lattice

Carnegie Mellon University Tepper School of Business

1. https://arxiv.org/pdf/1007.3726.pdf William Larimer Mellon, Founder



## **Tensor Processing Units (TPU)**



Cloud TPU v2 180 teraflops 64 GB High Bandwidth Memory (HBM)



Cloud TPU v3 420 teraflops 128 GB HBM



Cloud TPU v2 Pod 11.5 petaflops 4 TB HBM 2-D toroidal mesh network



Cloud TPU v3 Pod 100+ petaflops 32 TB HBM 2-D toroidal mesh network

#### Machine learning performance and benchmarks



ResNet-50 Training Cost Comparison

**Tensor Processing Unit** (**TPU**) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google specifically for neural network machine learning, particularly using Google's own TensorFlow software.

- **Carnegie Mellon University** Tepper School of Business
- 1. <u>https://en.wikipedia.org/wiki/Tensor\_Processing\_Unit#/media/File:Tensor\_</u> Processing\_Unit\_3.0.jpg
- 2. https://cloud.google.com/tpu



## **TPU for 2D Ising**

Checkerboard Algorithm: break lattice in sublattices and group equal spins to easily operate on them



Figure 3: A 2-d checkerboard: (1) Original checkerboard: on the left, the  $16 \times 16$  board is split into a  $4 \times 4$  grid of  $4 \times 4$  sub-lattices, i.e., it is represented by a [4, 4, 4, 4] tensor, where [l, k, :, :] is the sub-lattice at [l, k] of the grid; on the right, the sub-lattice is zoomed in and the indices of its spin sites are shown; (2) Reorganized checkerboard: one the left, each  $4 \times 4$  sub-lattice is reorganized by 4 "compact"  $2 \times 2$  sub-lattices; on the right, 4 "compact"  $2 \times 2$  sub-lattices are zoomed in and their original indices from the  $4 \times 4$  sub-lattice are shown. In general, such alternate coloring of black and white can be extended to lattices with any dimensions.

#### Carnegie Mellon University

Tepper School of Business

1. https://arxiv.org/pdf/1903.11714.pdf



#### **Correctness of TPU's results**



1. https://arxiv.org/pdf/1903.11714.pdf

**Carnegie Mellon University** Tepper School of Business



## **Efficiency of TPU cluster**

| lattice size $n^2$   | (flips/ns) | (nJ/flip) | ]                                              |         |              |         |
|----------------------|------------|-----------|------------------------------------------------|---------|--------------|---------|
| $(20 \times 128)^2$  | 8.1920     | 12.2070   | ]                                              |         |              |         |
| $(40 \times 128)^2$  | 9.3623     | 10.6811   | 13.0<br>12.5<br>12.0                           | -       | -            |         |
| $(80 \times 128)^2$  | 12.3362    | 8.1062    | v 12.0                                         |         |              |         |
| $(160 \times 128)^2$ | 12.8266    | 7.7963    | 511.0                                          |         |              |         |
| $(320 \times 128)^2$ | 12.9056    | 7.7486    | SU 11.0<br>SU 11.0<br>JU 10.0<br>JU 9.5<br>9.0 |         |              |         |
| $(640 \times 128)^2$ | 12.8783    | 7.7650    | 8.0                                            |         |              |         |
| GPU in [23, 3]       | 7.9774     | <u></u>   | 28                                             | -28     | - 28         |         |
| Nvidia Tesla V100    | 11.3704    | 21.9869   | 20×128<br>40×128<br>80×128                     | 160×128 | 320x128      | 640x128 |
| FPGA in [20]         | 614.4      | —         | Ţ                                              | -       | lattice size | 0       |

Better performance and less energy consumption than Nvidia GPUs, until... (next slide)

Really far from Field-programmable gate array (FPGA)! (a couple slides more)

Code available in Github and replicable results (with a Google Cloud account) <u>https://github.com/google-research/google-research/blob/master/simulation\_research/ising\_model/ising\_mcmc\_tpu.ipynb</u>

1. https://arxiv.org/pdf/1903.11714.pdf

#### **Carnegie Mellon University** Tepper School of Business





## Nvidia's Rebuttal!



Table 2: Flips per nanosecond obtained by the optimized multi-spin code on a single Tesla V100-SXM card with different lattice sizes, requiring an amount of memory ranging from 2MB to 30GB. For comparison purposes, the table also reports the best timings with 1 and 32 TPUv3 cores from [7], and with 1 FPGA from [8].



# Field-programmable gate array (FPGA)





15

A **field-programmable gate array** (**FPGA**) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence the term "field-programmable"... Circuit diagrams were previously used to specify the configuration, but this is increasingly rare due to the advent of electronic design automation tools.

2.

nttps://en.wikipedia.org/wiki/Field-programmable\_gate\_array

#### Carnegie Mellon University

Tepper School of Business

William Larimer Mellon, Founder

https://arxiv.org/pdf/1602.03016.pdfu



## **FPGA for 2D Ising Models**



Checkerboard Algorithm diagram

| Platform      | # updated spins | Ratio |
|---------------|-----------------|-------|
| CPU           | 62              | 1     |
| Single GPU    | 7977            | 129   |
| Previous FPGA | 94127           | 1518  |
| 64 GPUs       | 206000          | 3322  |
| Our FPGA      | 614400          | 9909  |

Number of spinflips per microsecond for the 1024x1024 lattice

#### 

Circuit Diagram Single Spin



Circuit Diagram Random Number Generation

#### **Carnegie Mellon University** Tepper School of Business

1. https://arxiv.org/pdf/1602.03016.pdf

William Larimer Mellon, Founder

16 🤇







1. https://arxiv.org/pdf/1602.03016.pdf

#### **Carnegie Mellon University** Tepper School of Business



### **Working with general Ising Models**



Main concern: How to actually

solve NP-Hard Problem?

Conventional computing (Von Neumann architecture) Natural computing Natural computing

Using a natural computing approach you would ideally use Adiabatic Quantum Computing, and realistically Quantum Annealing

Schrödinger equation

$$i\eta \frac{d}{dt} |\psi\rangle = H |\psi\rangle$$
  $H(\sigma_1, \sigma_2, \cdots, \sigma_n) = -\frac{1}{2} \sum_i \sum_j J_{i,j} \sigma_i \sigma_j + \sum_i h_i \sigma_i$ 

One cannot efficiently solve this equation using classical computers (if so, why would we need quantum computers after all!)

The issue then relies on (classically) Simulating Quantum Annealing

[1] A 20k-Spin Ising Chip to Solve Combinatorial Optimization Problems With CMOS Annealing. Yamaoka, Yoshimura, Hayashi, Okuyama, Aoki, and Carnegie Mellon University Mizuno



[2] https://arxiv.org/pdf/1807.10750-pd

William Larimer Mellon, Founder

Tepper School of Business

### **Graphical Processing Units**

Algorithm 1 Modified Ising annealing algorithm

- 1: input:  $(M, N, \mathbf{S})$
- 2: initialize all  $\sigma_i$  in **S**
- 3: for sweep-id in  $\{1, 2, ..., M\}$  do
- 4: for  $\sigma_i$  in S do

5: 
$$\sigma_i \leftarrow \operatorname{argmin}(H(\sigma_i))$$
 based on  $\mathcal{H}_i(\sigma_i) = \left(-\sum J_{i,j}\sigma_j - h_i\right)$ 

- 6: end for
- 7: randomly choose and flip N spin glasses in **S**
- 8: decrease N
- 9: end for

Algorithm 2 GPU Simulated Annealing method for Ising model

input:  $(F_p, \mathbf{S})$ initialize ALL  $\sigma_i$  in  $\mathbf{S}$ while  $F_p > 0$  do for all  $\sigma_i \in \mathbf{S}$  in parallel do  $\sigma_i \leftarrow \operatorname{argmin}(H(\sigma_i))$ flip  $\sigma_i$  with probability  $F_p$ end for reduce  $F_p$ end while

Tepper School of Business

William Larimer Mellon, Founder

"One may notice that since each spin glass may have a different number of neighbors, then the threads will not be perfectly load balanced."



https://arxiv.org/pdf/1807.10750 .pdf Cook, Zhao, Sato, Hiromoto



19

#### **GPU Performance**

|         | Days      | Seconds             |
|---------|-----------|---------------------|
| # edges | CPLEX cut | GPU cut (%accuracy) |
| 9999    | 9473      | 8884 (93.78%)       |
| 14999   | 13357     | 12776(95.65%)       |
| 24998   | 20206     | 19981(98.88%)       |
| 49995   | 35248     | 36228(100.29%)      |
| 39998   | 33605     | 32914(97.94%)       |
| 59997   | 46371     | 46510(100.29%)      |
| 99995   | 70566     | 72009(102.04%)      |
| 199990  | 128448    | 131930(102.71%)     |
| 249995  | 176556    | 179391(101.60%)     |
| 374993  | 248505    | 255078(102.64%)     |
| 626988  | 392912    | 400540(101.94%)     |
| 1249975 | 741709    | 751050(101.25%)     |



## Carnegie Mellon University

Tepper School of Business

William Larimer Mellon, Founder

https://mathworld.wolfram.c sGridGraph.html

CENTER

#### **Simulated Bifurcation Machine**

"The method in [previous slide] ignores the data dependencies to implement parallel computation on fully connected spin models. Since the modified algorithm in [previous slide] does not follow the mathematical model that the Quantum Monte Carlo is based on, the output of the simulation could deviate from the optimum."

How can we efficiently simulate quantum annealing? We can take a classical approximation

0.5 E(t)

1 (t) 0.5 (t)

1.

Equations model the bifurcation (Anil's lecture)

-1.5-1-0.5 0 0.5 1 1.5

-15-1-05005

-1.5-1-0.5 0 0.5 1 1.5

150

100 150

150

100

200

200

 $H_{SB}(\vec{x}, \vec{y}, t) = \sum_{i=1}^{N} \frac{\Delta}{2} {y_i}^2 + V(\vec{x}, t)$ 

$$V(\vec{x},t) = \sum_{i=1}^{N} \left[ \frac{K}{4} x_i^4 + \frac{\Delta - p(t)}{2} x_i^2 \right] - \frac{\xi_0}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} J_{i,j} x_i x_j$$

$$\frac{\partial x_i}{\partial t} = \frac{\partial H_{SB}}{\partial y_i} = \Delta y_i$$

$$\frac{\partial y_i}{\partial t} = -\frac{\partial H_{SB}}{\partial x_i} = -[Kx_i^2 - p(t) + \Delta]x_i + \xi_0 \sum_{j=1}^N J_{i,j}x_j$$

21

**Carnegie Mellon University** <sub>2.</sub> Tepper School of Business

-0.5

0.5

А

o(t)

 $\epsilon_1(t), y_1(t)$ 

 $_{2}^{2}(t), y_{2}(t)$ 

0

1.5

Waidyasooriya, Hasitha, and Masanori Hariyama. "Highly-parallel FPGA accelerator for simulated quantum annealing." IEEE Transactions on Emerging Topics in Computing (2019). Goto, Hayato, Kosuke Tatsumura, and Alexander R. Dixon. "Combinatorial optimization by simulating adjust

bifurcations in nonlinear Hamiltonian systems." Science advances 5.4 (2019): eaav2372. William Larimer Mellon, Founder





#### **Simulated Bifurcation Machine**

Authors implemented algorithm in FPGA to solve up to 20,000 nodes fully connected graphs



 $\begin{array}{c} \mathbf{B} & 40 \\ \mathbf{S} \\ \mathbf{S}$ 

Authors implemented algorithm in CPU and GPU to solve up to 1'000,000 nodes fully connected graphs



# TOSHIBA aws

- 1. Goto, Hayato, Kosuke Tatsumura, and Alexander R. Dixon. "Combinatorial optimization by simulating adiabatic bifurcations in nonlinear Hamiltonian systems." Science advances 5.4 (2019): eaav2372.
- 2. http://www.toshiba-sol.co.jp/en/pro/sbm/index.htm

22 🤇

#### Carnegie Mellon University Tepper School of Business

## **Complementary metal-oxide semiconductors (CMOS)**





1.



CMOS Static RAM (SRAM) Circuits

Yamaoka, Masanao, et al. "A 20k-spin Ising chip to solve combinatorial optimization problems with CMOS annealing." IEEE Journal of Solid-State Circuits 51.1 (2015): 303-309.

Carnegie Mellon University CMOS Tepper School of Business

William Larimer Mellon, Founder

23

## **Complementary metal-oxide semiconductors (CMOS)**





Each Spin has 5 neighbors (Up, Down, Right, Left, Front)

1.

Spin implementation as logic gates

Use low voltage to indice random errors in SRAM and jump local minima

Yamaoka, Masanao, et al. "A 20k-spin Ising chip to solve combinatorial optimization problems with CMOS annealing." IEEE Journal of Solid-State Circuits 51.1 (2015): 303-309.

Carnegie Mellon University CMOS Tepper School of Business

William Larimer Mellon, Founder

24 <u>C</u>

# 7

## **Complementary metal-oxide semiconductors (CMOS)**



#### Actual Chip and Specs

| Items                                            | Value                                                                                                              |
|--------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Number of spins                                  | 20k (80 × 256)                                                                                                     |
| Process                                          | 65 nm                                                                                                              |
| Chip area                                        | $4 \times 3 = 12 \text{ mm}^2$                                                                                     |
| Area of spin                                     | $11.27 \times 23.94 = 270 \mu m^2$                                                                                 |
| Number of<br>SRAM cells                          | 260k bits<br>Spin value: 1 bit<br>Interaction factor: 2 bit × 5 = 10 bits<br>External magnetic coefficient: 2 bits |
| Memory IF                                        | 100 MHz                                                                                                            |
| Interaction speed                                | 100 MHz                                                                                                            |
| Operating current<br>of core circuits<br>(1.1 V) | Write: 2.0 mA<br>Read: 6.0 mA<br>Interaction: 44.6 mA                                                              |



10 ms (1 000 000 steps)

25

1. Yamaoka, Masanao, et al. "A 20k-spin Ising chip to solve combinatorial optimization problems with CMOS annealing." IEEE Journal of Solid-State Circuits 51.1 (2015): 303-309.

Carnegie Mellon University CMOS Tepper School of Business

## **Complementary metal-oxide semiconductors (CMOS)**



1. Yamaoka, Masanao, et al. "A 20k-spin Ising chip to solve combinatorial optimization problems with CMOS annealing."

Carregie Verlative Masato Havashi, and Masanao Yamaoka. "An Ising computer based on simulated quantum annealing by path integral Monte Carlo method."

Teppentischer Mellon, Founder

26





#### Let's go to this interactive interface of the CMOS device from Hitachi

https://annealing-cloud.com/en/play/ising-editor.html

**Carnegie Mellon University** Tepper School of Business



## **Digital Annealers**

CMOS Implementation of Ising solution method Fully connected 1024 nodes 16-bit precision vs. 4-bit precision D-Wave

"For obtaining exact solutions of small-size problems, the SA machine called "Digital Annealer" may be the fastest so far."

2





https://arxiv.org/pdf/1806.08815.pdf 1.

https://spectrum.ieee.org/tech-talk/computing/hardware/fuiitsus-cmos-digital-annealer-produces-guantum-comput er-speeds

Carnegie Mellon University <sub>3.</sub> Goto, Hayato, Kosuke Tatsumura, and Alexander R. Dixon. "Combinatorial optimization by simulating adjabation of the second Tepper School of Busine Surcations in nonlinear Hamiltonian systems "Science advances 5.4 (2019): eaav2372.

#### **Digital Annealers**



| Alg                          | orithm 1 Simulated Annealing (SA                                                                                                           |                                                                    | Al                | gorithm 2 The Digital Annealer's Algorithm                                                                                                                      |
|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1: f<br>2:<br>3:<br>4:<br>5: | or each run do<br>initialize to random initial state<br>for each temperature do<br>for each MC sweep at this tempe<br>for each variable do |                                                                    |                   | $E_{\text{offset}} \leftarrow 0$                                                                                                                                |
| 6:<br>7:                     | propose a flip                                                                                                                             | Parallel-trial<br>and effective facost acceptanc                   | -                 | <ul> <li>for each MC step (iteration) do</li> <li>if due for temperature update, update the temperature</li> <li>for each variable j, in parallel do</li> </ul> |
| 8:<br>9:                     | end for<br>end for                                                                                                                         | probability                                                        | 8:<br>9:          |                                                                                                                                                                 |
| 10:<br>11:<br>12: •          | end for                                                                                                                                    | Dynamic os                                                         | 10:<br>11:        | if at least one flip accepted then                                                                                                                              |
|                              |                                                                                                                                            | Dynamic off-set escape<br>Helps surmount short,<br>harrow barriers | 12:<br>13:<br>14: | update the state and effective fields, in parallel                                                                                                              |
|                              |                                                                                                                                            |                                                                    | 15:<br>16:<br>17: | $E_{\text{offset}} \leftarrow E_{\text{offset}} + \text{offset\_increase\_rate}$                                                                                |
|                              |                                                                                                                                            |                                                                    | 18:               | end for<br>end for                                                                                                                                              |

#### 1. <u>https://arxiv.org/pdf/1806.08815.pdf</u>

2. https://spectrum.ieee.org/tech-talk/computing/hardware/fuji tsus-cmos-digital-annealer-produces-quantum-computer-s

29

William Larimer Mellon, Founder

**Carnegie Mellon University** Tepper School of Business

## **Parallel Tempering**



| Algorithm 1 Simulated Annealing (SA) |                                                    | Algorithm 3 Parallel Tempering with Isoenergetic Clu<br>(PT+ICM)                         | ister Moves |  |  |  |  |
|--------------------------------------|----------------------------------------------------|------------------------------------------------------------------------------------------|-------------|--|--|--|--|
| 1: 1                                 | for each run do                                    | 1: initialize all replicas with random initial states                                    |             |  |  |  |  |
| 2:                                   | initialize to random initial state                 | 2: for each MC sweep do                                                                  |             |  |  |  |  |
| 3:                                   | for each temperature do                            | 3: for each replica, for each variable do                                                |             |  |  |  |  |
| 4:                                   | for each MC sweep at this temperature do           | 4: propose a flip                                                                        |             |  |  |  |  |
| 5:                                   | for each variable do                               | 5: if accepted, update the state and effective fields                                    |             |  |  |  |  |
| 6:                                   | propose a flip                                     | 6: end for                                                                               |             |  |  |  |  |
| 7:                                   | if accepted, update the state and effective fields | 7: for each pair of sequential replicas do                                               |             |  |  |  |  |
| 8:                                   | end for                                            | 8: propose a replica exchange                                                            |             |  |  |  |  |
| 9:                                   | end for                                            | 9: if accepted, swap the temperatures between the replicas                               |             |  |  |  |  |
| 10:                                  | update the temperature                             | 10: end for                                                                              |             |  |  |  |  |
| 11:                                  | end for                                            | 11: perform ICM update, swapping the states of a cluster of variables that               |             |  |  |  |  |
| 12:                                  | end for                                            | have opposite states in the two replicas; update the states and fields for both replicas |             |  |  |  |  |
|                                      |                                                    | 12: end for                                                                              |             |  |  |  |  |

- Instead of having a single state you have several replicas
- Then the flips can be done among replicas
- It can be implemented in the Digital Annealer
- Additionally: There can be cluster updates (flip more than one spin if they are "connected")
  - Similar to Anil's intuition on the Swendsen-Wang Algorithm

#### Carnegie Mellon University

- 1. <u>https://arxiv.org/pdf/1806.08815.pdf</u>
- 2. https://spectrum.ieee.org/tech-talk/computing/hardware/fuji tsus-cmos-digital-annealer-produces-quantum-computer-s

Tepper School of Business

William Larimer Mellon, Founder

30

## **Digital Annealing v Simulated Annealing v Parallel Tempering**

#### **Fully connected instances**



**Digital Annealer Wins** 



Sparse instances

- DA Digital Annealer
- SA Simulated Annealing
- PT(+ICM) Parallel Tempering (+Isoenergetic Cluster Moves)
- PTDA Parallel Tempering Digital Annealer

#### **Parallel Tempering Wins**

- 1. https://arxiv.org/pdf/1806.08815.pdf
- S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, and M. Troyer, Optimized simulated annealing for Ising spin glasses, Comput. Phys. Commun. 192, 265 (2015)

#### Carnegie Mellon University Tepper School of Business

William Larimer Mellon, Founder

31 CAPD







#### Quantum Computing Challenge Series

| СН  | Quantum Computing Challenge Series - Max Cut Marathon Match      | \$11,500 |
|-----|------------------------------------------------------------------|----------|
| тсо | Ended Apr 04 Marathon Match                                      | Purse    |
| CH  | Quantum Computing Learning Challenge #3 - Max Cut                | \$250    |
| TCO | Ended Aug 04 Python Data Science Other                           | Purse    |
| СН  | Quantum Computing Learning Challenge 2 - Scheduling              | \$250    |
| тсо | Ended Feb 28 Python Data Science Other                           | Purse    |
| СН  | Quantum Computing Learning Challenge #1 - Solve Sudoku Instantly | \$250    |
| TCO | Ended Feb 14 Algorithm Python Data Science +1                    | Purse    |



**Carnegie Mellon University** Tepper School of Business



- 1. <u>https://arxiv.org/pdf/1806.08815.pdf</u>
- 2. <u>https://spectrum.ieee.org/tech-talk/computing/hardware/fuji</u> <u>tsus-cmos-digital-annealer-produces-quantum-computer-s</u> <u>peeds</u>
- 3. https://tc3-japan.github.io/DA\_tutorial/index.html *William Larimer Mellon, Founder*



## Digital Annealer v Application-specific integrated circuit v FPGA v GPU

|                                                                           | Fujitsu Digital<br>Annealer [25] | Hitachi CMOS<br>Annealer [15], [26]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Hitachi CMOS<br>Annealer [16], [26] | FPGA accelerator                                           |
|---------------------------------------------------------------------------|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|------------------------------------------------------------|
| Maximum number of spins                                                   | 8192                             | 61,952                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 6,400                               | 32,768                                                     |
| Type of coupling                                                          | Total coupling                   | King graph                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | King graph                          | Total Coupling                                             |
| Number of couplings                                                       | 67 million                       | 0.37 million                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 0.4 million                         | 1 billion                                                  |
| Computation                                                               | 64-bit fixed-point               | not mentioned                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 8-bit fixed point                   | 32-bit floating-point                                      |
| Implementation                                                            | ASIC                             | ASIC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | FPGA                                | 2-FPGA connected via fibe                                  |
| Category                                                                  | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                     | accelerator                                                |
| C ·                                                                       | <b>FD</b>                        | and the second se |                                     |                                                            |
|                                                                           | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                     |                                                            |
| Speed-up                                                                  | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ler                                 | nentation                                                  |
| Speed-up<br>Accuracy                                                      | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ler                                 | nentation                                                  |
| Speed-up<br>Accuracy<br>Problem size                                      | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ati                                 | nentation<br>on                                            |
| Speed-up<br>Accuracy                                                      | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ati                                 | nentation                                                  |
| Speed-up<br>Accuracy<br>Problem size                                      | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ely                                 | nentation<br>on                                            |
| Speed-upAccuracyProblem sizePower consumption                             | FP                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ely                                 | nentation<br>on<br>large                                   |
| Speed-upAccuracyProblem sizePower consumptionPower-efficiencyAvailability | Requi                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ely<br>m                            | nentation<br>on<br>large<br>small<br>PCs to supercomputers |
| Speed-upAccuracyProblem sizePower consumptionPower-efficiencyAvailability |                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ely<br>ely<br>m D<br>A,             | nentation<br>on<br>large<br>small                          |

1. Waidyasooriya, Hasitha Muthumala, and Masanori Hariyama. "A GPU-Based Quantum Annealing Simulator for Fully-Connected Ising Models Utilizing Spatial and Temporal Parallelism."

Waidyasooriya, H.M., Hariyama, M., Miyama, M.J. et al. OpenCL-based design of an FPGA accelerator for quantum annealing

#### Carnegie Mellon University mulation.

Tepper School of Bushleads S William Larimer Mellon, Founder



## **Alternatives available**

|                                                | Fixstars<br>Optigan     | D-Wave<br>2000Q           | Hitachi<br>CMOS<br>Annealing | Fujitsu<br>Digital<br>Annealer | Toshiba<br>SBM     |
|------------------------------------------------|-------------------------|---------------------------|------------------------------|--------------------------------|--------------------|
| Calculation method                             | GPU                     | Quantum<br>annealing      | Digital<br>circuit           | Digital<br>circuit             | GPU                |
| Maximum number of bits                         | Over<br>100,000         | 2,048<br>(16x16x8)        | 61,952<br>(352x176)          | 1,024/<br>8,192                | 10,000             |
| Coefficient parameter                          | Digital (32<br>/ 64bit) | Analog<br>(about<br>5bit) | Digital (3bit)               | Digital<br>(16/64<br>bit)      | Digital<br>(32bit) |
| Combined graph                                 | Fully<br>combined       | Chimera<br>graph          | King Graph                   | Fully combined                 | Fully<br>combined  |
| Total number of<br>combined conversion<br>bits | 65,536                  | 64                        | 176                          | 1,024/<br>8,192                | 1,000              |
| API endpoint                                   | Fixstars                | D-Wave<br>Cloud           | Annealing<br>Cloud Web       | DA Cloud                       | AWS                |

**Carnegie Mellon University** Tepper School of Business





## **Playing with Fixstars's Optigan**

#### Let's go to this interactive interface of the GPU implementation from Fixstars Optigan

https://quantum.fixstars.com/product/demo#demoModal

Translated version (but you cannot run it)

https://colab.research.google.com/github/bernalde/QuIP/blob/master/noteb ooks/Notebook%208%20-%20Amplify%20Tutorials.ipynb

**Carnegie Mellon University** Tepper School of Business







- Non-linear Integer Programs model a variety of real world problems from many domains
- Solving them classically has limitations, especially with non-convex objectives, constrained integer variables
- We explored non-classical approaches based on QUBO/Ising
- GAMA is a general purpose heuristic that utilizes Graver Test-Set
- There are a variety of options to solve Ising model

Carnegie Mellon UniversityTepper School of Businesswill

